Skip to content

Releases: spider-rs/spider

v1.93.43

03 May 18:35
Compare
Choose a tag to compare

Whats Changed

Generating random real user-agents can now be done using ua_generator@0.4.1.
Spoofing http headers can now be done with the spoof flag.

Use ua_generator::ua::UserAgents if you need dynamic User-Agent randomizer followed with website.with_user_agent.

  • feat(spoof): add referrer spoofing
  • feat(spoof): add real user-agent spoofing
  • feat(chrome): add dynamic chrome connections

Full Changelog: v1.93.23...v1.93.43

v1.93.13

23 Apr 11:08
Compare
Choose a tag to compare

Whats Changed

Updated crate compatibility with reqwest@0.12.4 and fixed headers compile for worker.
Remove http3 feature flag - follow the unstable instructions if needed.

The function website.get_domain renamed to website.get_url.
The function website.get_domain_parsed renamed to website.get_url_parsed.

  • chore(worker): fix headers flag compile
  • chore(crates): update async-openai@0.20.0
  • chore(openai): trim start messages content output text
  • chore(website): fix url getter function name

Full Changelog: v1.93.3...v1.93.13

v1.93.3

14 Apr 21:20
Compare
Choose a tag to compare

Whats Changed

You can now take screenshots per step when using OpenAI to manipulate the page.
Connecting to a proxy on chrome headless remote is now fixed.

  1. feat(openai): add screenshot js execution after effects
  2. feat(openai): add deserialization error determination
  3. chore(chrome): fix proxy server headless connecting
    use spider::configuration::GPTConfigs;
  
    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec!["Search for Movies", "Extract the hrefs found."],
        3000,
    );

    gpt_config.screenshot = true;
    gpt_config.set_extra(true);

Full Changelog: v1.92.0...v1.93.3

v1.92.0

13 Apr 17:30
Compare
Choose a tag to compare

What's Changed

Caching OpenAI responses can now be done using the 'cache_openai' flag and a builder method.

  • docs: fix broken glob url link by @emilsivervik in #179
  • feat(openai): add response caching

Example

extern crate spider;

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::moka::future::Cache;
use spider::tokio;
use spider::website::Website;
use std::time::Duration;

#[tokio::main]
async fn main() {
    let cache = Cache::builder()
        .time_to_live(Duration::from_secs(30 * 60))
        .time_to_idle(Duration::from_secs(5 * 60))
        .max_capacity(10_000)
        .build();

    let mut gpt_config: GPTConfigs = GPTConfigs::new_multi_cache(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
        Some(cache),
    );
    gpt_config.set_extra(true);

    let mut website: Website = Website::new("https://www.google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_limit(1)
        .with_openai(Some(gpt_config))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    let handle = tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("---\n{}\n{:?}\n{:?}\n---", page.get_url(), page.openai_credits_used, page.extra_ai_data);
        }
    });

    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();
    let links = website.get_links();

    println!(
        "(0) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    // crawl the page again to see if cache is re-used.
    let start = crate::tokio::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    website.unsubscribe();

    let _ = handle.await;

    println!(
        "(1) Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );
}

New Contributors

Full Changelog: v.1.91.1...v1.92.0

v1.90.0

29 Mar 01:01
Compare
Choose a tag to compare

Whats Changed

RSS feeds handled automatically on crawls.

  1. feat(rss): add rss support
  2. chore(openai): fix compile chrome flag
  3. chore(crate): remove serde pin
  4. chore(website): fix sitemap chrome build
  5. chore(crate): remove pins on common crates ( reduces build size )
  6. chore(openai): fix prompt deserialization
  7. chore(openai): add custom api key config

Full Changelog: v1.89.0...v1.90.0

v.1.91.1

10 Apr 09:52
Compare
Choose a tag to compare

Whats Changed

The AI results now return the input(prompt), js_ouput, and content_output.

Full Changelog: v1.90.0...v.1.91.1

v1.88.7

26 Mar 16:50
Compare
Choose a tag to compare

Whats Changed

You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data.

The credits used can be checked with Page.openai_credits_used.

  1. chore(page): return all page content regardless of status
  2. chore(openai): fix svg removal
  3. feat(openai): add extra data gpt curating
  4. chore(openai): add credits used response
  5. feat(fingerprint): add fingerprint id configuration
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;

#[tokio::main]
async fn main() {
    let gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
    );
    
   let mut website: Website = Website::new("https://www.google.com")
        .with_openai(Some(gpt_config))
        .with_limit(1)
        .build()
        .unwrap();
        
     website.crawl().await;
}

Image displaying google clicking on the first search result using AI to get the prompts.

Full Changelog: v1.87.3...v1.88.7

v1.87.3

25 Mar 18:44
Compare
Choose a tag to compare

Whats Changed

You can now bypass Cloudflare protected page with the feature flag [real_browser].

  • feat(real_browser): add real_browser feature flag for chrome

Full Changelog: v1.86.16...v1.87.3

v1.86.16

19 Mar 17:04
Compare
Choose a tag to compare

What's Changed

You can now dynamically drive the browser with custom scripts using OpenAI.
Make sure to set the OPENAI_API_KEY env variable or pass it in to the program.

  • Openai/chrome driver by @j-mendez in #174
  • chore(page): add cold fusion file crawling support
extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let _ = tokio::fs::create_dir_all("./storage/").await;

    let screenshot_params =
        spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
    let screenshot_config =
        spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);

    let mut website: Website = Website::new("https://google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_screenshot(Some(screenshot_config))
        .with_limit(1)
        .with_openai(Some(GPTConfigs::new(
            "gpt-4-1106-preview",
            "Search for Movies",
            500,
        )))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{}\n{}", page.get_url(), page.get_html());
        }
    });

    website.crawl().await;
}

The output of the custom script from the AI:

Custom Script output

The screenshot of the page output:

Output of the page

Full Changelog: v1.85.4...v1.86.16

v1.85.4

11 Mar 18:50
Compare
Choose a tag to compare

Whats Changed

You can now update the crawl links outside of the context by using website.queue to get a sender.

  • feat(q): add mid crawl queue
  • chore(chrome): fix semaphore limiting scrape
use spider::tokio;
use spider::url::Url;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();
    let mut g = website.subscribe_guard().unwrap();
    let q = website.queue(100).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let u = res.get_url();
            println!("{:?}", u);
            let mut url = Url::parse(u).expect("Failed to parse URL");

            let mut segments: Vec<_> = url
                .path_segments()
                .map(|c| c.collect::<Vec<_>>())
                .unwrap_or_else(Vec::new);

            if segments.len() > 0 && segments[0] == "en" {
                segments[0] = "fr";
                let new_path = segments.join("/");
                url.set_path(&new_path);
                // get a new url here or perform an action and queue links
                // pre-fetch all fr locales
                let _ = q.send(url.into());
            }
            g.inc();
        }
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        website.get_links().len()
    )
}

Thanks @oiwn

Full Changelog: v1.84.11...v1.85.4