Releases: spider-rs/spider
v1.93.43
Whats Changed
Generating random real user-agents can now be done using ua_generator@0.4.1
.
Spoofing http headers can now be done with the spoof
flag.
Use ua_generator::ua::UserAgents if you need dynamic User-Agent randomizer followed with website.with_user_agent
.
- feat(spoof): add referrer spoofing
- feat(spoof): add real user-agent spoofing
- feat(chrome): add dynamic chrome connections
Full Changelog: v1.93.23...v1.93.43
v1.93.13
Whats Changed
Updated crate compatibility with reqwest@0.12.4
and fixed headers compile for worker
.
Remove http3
feature flag - follow the unstable instructions if needed.
The function website.get_domain
renamed to website.get_url
.
The function website.get_domain_parsed
renamed to website.get_url_parsed
.
- chore(worker): fix headers flag compile
- chore(crates): update async-openai@0.20.0
- chore(openai): trim start messages content output text
- chore(website): fix url getter function name
Full Changelog: v1.93.3...v1.93.13
v1.93.3
Whats Changed
You can now take screenshots per step when using OpenAI to manipulate the page.
Connecting to a proxy on chrome headless remote is now fixed.
- feat(openai): add screenshot js execution after effects
- feat(openai): add deserialization error determination
- chore(chrome): fix proxy server headless connecting
use spider::configuration::GPTConfigs;
let mut gpt_config: GPTConfigs = GPTConfigs::new_multi(
"gpt-4-1106-preview",
vec!["Search for Movies", "Extract the hrefs found."],
3000,
);
gpt_config.screenshot = true;
gpt_config.set_extra(true);
Full Changelog: v1.92.0...v1.93.3
v1.92.0
What's Changed
Caching OpenAI responses can now be done using the 'cache_openai' flag and a builder method.
- docs: fix broken glob url link by @emilsivervik in #179
- feat(openai): add response caching
Example
extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::moka::future::Cache;
use spider::tokio;
use spider::website::Website;
use std::time::Duration;
#[tokio::main]
async fn main() {
let cache = Cache::builder()
.time_to_live(Duration::from_secs(30 * 60))
.time_to_idle(Duration::from_secs(5 * 60))
.max_capacity(10_000)
.build();
let mut gpt_config: GPTConfigs = GPTConfigs::new_multi_cache(
"gpt-4-1106-preview",
vec![
"Search for Movies",
"Click on the first result movie result",
],
500,
Some(cache),
);
gpt_config.set_extra(true);
let mut website: Website = Website::new("https://www.google.com")
.with_chrome_intercept(true, true)
.with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
.with_limit(1)
.with_openai(Some(gpt_config))
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
let handle = tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("---\n{}\n{:?}\n{:?}\n---", page.get_url(), page.openai_credits_used, page.extra_ai_data);
}
});
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
println!(
"(0) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
links.len()
);
// crawl the page again to see if cache is re-used.
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
website.unsubscribe();
let _ = handle.await;
println!(
"(1) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
links.len()
);
}
New Contributors
- @emilsivervik made their first contribution in #179
Full Changelog: v.1.91.1...v1.92.0
v1.90.0
Whats Changed
RSS feeds handled automatically on crawls.
- feat(rss): add rss support
- chore(openai): fix compile chrome flag
- chore(crate): remove serde pin
- chore(website): fix sitemap chrome build
- chore(crate): remove pins on common crates ( reduces build size )
- chore(openai): fix prompt deserialization
- chore(openai): add custom api key config
Full Changelog: v1.89.0...v1.90.0
v.1.91.1
Whats Changed
The AI results now return the input(prompt), js_ouput, and content_output.
Full Changelog: v1.90.0...v.1.91.1
v1.88.7
Whats Changed
You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data
.
The credits used can be checked with Page.openai_credits_used
.
- chore(page): return all page content regardless of status
- chore(openai): fix svg removal
- feat(openai): add extra data gpt curating
- chore(openai): add credits used response
- feat(fingerprint): add fingerprint id configuration
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;
#[tokio::main]
async fn main() {
let gpt_config: GPTConfigs = GPTConfigs::new_multi(
"gpt-4-1106-preview",
vec![
"Search for Movies",
"Click on the first result movie result",
],
500,
);
let mut website: Website = Website::new("https://www.google.com")
.with_openai(Some(gpt_config))
.with_limit(1)
.build()
.unwrap();
website.crawl().await;
}
Full Changelog: v1.87.3...v1.88.7
v1.87.3
Whats Changed
You can now bypass Cloudflare protected page with the feature flag [real_browser]
.
- feat(real_browser): add real_browser feature flag for chrome
Full Changelog: v1.86.16...v1.87.3
v1.86.16
What's Changed
You can now dynamically drive the browser with custom scripts using OpenAI.
Make sure to set the OPENAI_API_KEY
env variable or pass it in to the program.
extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let _ = tokio::fs::create_dir_all("./storage/").await;
let screenshot_params =
spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
let screenshot_config =
spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);
let mut website: Website = Website::new("https://google.com")
.with_chrome_intercept(true, true)
.with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
.with_screenshot(Some(screenshot_config))
.with_limit(1)
.with_openai(Some(GPTConfigs::new(
"gpt-4-1106-preview",
"Search for Movies",
500,
)))
.build()
.unwrap();
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(page) = rx2.recv().await {
println!("{}\n{}", page.get_url(), page.get_html());
}
});
website.crawl().await;
}
The output of the custom script from the AI:
The screenshot of the page output:
Full Changelog: v1.85.4...v1.86.16
v1.85.4
Whats Changed
You can now update the crawl links outside of the context by using website.queue
to get a sender.
- feat(q): add mid crawl queue
- chore(chrome): fix semaphore limiting scrape
use spider::tokio;
use spider::url::Url;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2 = website.subscribe(16).unwrap();
let mut g = website.subscribe_guard().unwrap();
let q = website.queue(100).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
let u = res.get_url();
println!("{:?}", u);
let mut url = Url::parse(u).expect("Failed to parse URL");
let mut segments: Vec<_> = url
.path_segments()
.map(|c| c.collect::<Vec<_>>())
.unwrap_or_else(Vec::new);
if segments.len() > 0 && segments[0] == "en" {
segments[0] = "fr";
let new_path = segments.join("/");
url.set_path(&new_path);
// get a new url here or perform an action and queue links
// pre-fetch all fr locales
let _ = q.send(url.into());
}
g.inc();
}
});
let start = std::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
)
}
Thanks @oiwn
Full Changelog: v1.84.11...v1.85.4