Releases: spider-rs/spider
v.1.91.1
Whats Changed
The AI results now return the input(prompt), js_ouput, and content_output.
Full Changelog: v1.90.0...v.1.91.1
v1.88.7
Whats Changed
You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data
.
The credits used can be checked with Page.openai_credits_used
.
- chore(page): return all page content regardless of status
- chore(openai): fix svg removal
- feat(openai): add extra data gpt curating
- chore(openai): add credits used response
- feat(fingerprint): add fingerprint id configuration
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;
#[tokio::main]
async fn main() {
let gpt_config: GPTConfigs = GPTConfigs::new_multi(
"gpt-4-1106-preview",
vec![
"Search for Movies",
"Click on the first result movie result",
],
500,
);
let mut website: Website = Website::new("https://www.google.com")
.with_openai(Some(gpt_config))
.with_limit(1)
.build()
.unwrap();
website.crawl().await;
}
Full Changelog: v1.87.3...v1.88.7
v1.87.3
Whats Changed
You can now bypass Cloudflare protected page with the feature flag [real_browser]
.
- feat(real_browser): add real_browser feature flag for chrome
Full Changelog: v1.86.16...v1.87.3
v1.86.16
What's Changed
You can now dynamically drive the browser with custom scripts using OpenAI.
Make sure to set the OPENAI_API_KEY
env variable or pass it in to the program.
extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let _ = tokio::fs::create_dir_all("./storage/").await;
let screenshot_params =
spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
let screenshot_config =
spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);
let mut website: Website = Website::new("https://google.com")
.with_chrome_intercept(true, true)
.with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
.with_screenshot(Some(screenshot_config))
.with_limit(1)
.with_openai(Some(GPTConfigs::new(
"gpt-4-1106-preview",
"Search for Movies",
500,
)))
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("{}\n{}", page.get_url(), page.get_html());
}
});
website.crawl().await;
}
The output of the custom script from the AI:
The screenshot of the page output:
Full Changelog: v1.85.4...v1.86.16
v1.85.4
Whats Changed
You can now update the crawl links outside of the context by using website.queue
to get a sender.
- feat(q): add mid crawl queue
- chore(chrome): fix semaphore limiting scrape
use spider::tokio;
use spider::url::Url;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2 = website.subscribe(16).unwrap();
let mut g = website.subscribe_guard().unwrap();
let q = website.queue(100).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
let u = res.get_url();
println!("{:?}", u);
let mut url = Url::parse(u).expect("Failed to parse URL");
let mut segments: Vec<_> = url
.path_segments()
.map(|c| c.collect::<Vec<_>>())
.unwrap_or_else(Vec::new);
if segments.len() > 0 && segments[0] == "en" {
segments[0] = "fr";
let new_path = segments.join("/");
url.set_path(&new_path);
// get a new url here or perform an action and queue links
// pre-fetch all fr locales
let _ = q.send(url.into());
}
g.inc();
}
});
let start = std::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
)
}
Thanks @oiwn
Full Changelog: v1.84.11...v1.85.4
v1.84.11
v1.84.9
Whats Changed
Chrome sitemap compile fix and defaulting to chrome.
- chore(sitemap): fix chrome sitemap handling
- feat(chrome): add auth challenge response
- chore(smart): fix smart mode http defaults
- chore(chrom_intercept): fix page hang
Full Changelog: v1.84.3...v1.84.9
v1.84.3
Whats Changed
Major increase performance of chrome crawls/scrapes by 2x
- perf(chrome): add direct page navigation
Full Changelog: v1.84.1...v1.84.3
v1.84.1
Whats Changed
- chore(chrome): fix network_wait_for page hang and inconsistent pages
- chore(chrome): fix concurrent page handling
- chore(chrome): fix smart mode http request default
- feat(chrome): add
chrome_headless_new
flag
Thanks @esemeniuc
Full Changelog: v1.83.6...v1.84.1
v1.83.0
Whats Changed
- feat(chrome): add wait_for configuration delay, selector, and idle network
Full Changelog: v1.82.7...v1.83.0