Skip to content

Releases: spider-rs/spider

v.1.91.1

10 Apr 09:52
Compare
Choose a tag to compare

Whats Changed

The AI results now return the input(prompt), js_ouput, and content_output.

Full Changelog: v1.90.0...v.1.91.1

v1.88.7

26 Mar 16:50
Compare
Choose a tag to compare

Whats Changed

You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data.

The credits used can be checked with Page.openai_credits_used.

  1. chore(page): return all page content regardless of status
  2. chore(openai): fix svg removal
  3. feat(openai): add extra data gpt curating
  4. chore(openai): add credits used response
  5. feat(fingerprint): add fingerprint id configuration
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;

#[tokio::main]
async fn main() {
    let gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
    );
    
   let mut website: Website = Website::new("https://www.google.com")
        .with_openai(Some(gpt_config))
        .with_limit(1)
        .build()
        .unwrap();
        
     website.crawl().await;
}

Image displaying google clicking on the first search result using AI to get the prompts.

Full Changelog: v1.87.3...v1.88.7

v1.87.3

25 Mar 18:44
Compare
Choose a tag to compare

Whats Changed

You can now bypass Cloudflare protected page with the feature flag [real_browser].

  • feat(real_browser): add real_browser feature flag for chrome

Full Changelog: v1.86.16...v1.87.3

v1.86.16

19 Mar 17:04
Compare
Choose a tag to compare

What's Changed

You can now dynamically drive the browser with custom scripts using OpenAI.
Make sure to set the OPENAI_API_KEY env variable or pass it in to the program.

  • Openai/chrome driver by @j-mendez in #174
  • chore(page): add cold fusion file crawling support
extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let _ = tokio::fs::create_dir_all("./storage/").await;

    let screenshot_params =
        spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
    let screenshot_config =
        spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);

    let mut website: Website = Website::new("https://google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_screenshot(Some(screenshot_config))
        .with_limit(1)
        .with_openai(Some(GPTConfigs::new(
            "gpt-4-1106-preview",
            "Search for Movies",
            500,
        )))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{}\n{}", page.get_url(), page.get_html());
        }
    });

    website.crawl().await;
}

The output of the custom script from the AI:

Custom Script output

The screenshot of the page output:

Output of the page

Full Changelog: v1.85.4...v1.86.16

v1.85.4

11 Mar 18:50
Compare
Choose a tag to compare

Whats Changed

You can now update the crawl links outside of the context by using website.queue to get a sender.

  • feat(q): add mid crawl queue
  • chore(chrome): fix semaphore limiting scrape
use spider::tokio;
use spider::url::Url;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();
    let mut g = website.subscribe_guard().unwrap();
    let q = website.queue(100).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let u = res.get_url();
            println!("{:?}", u);
            let mut url = Url::parse(u).expect("Failed to parse URL");

            let mut segments: Vec<_> = url
                .path_segments()
                .map(|c| c.collect::<Vec<_>>())
                .unwrap_or_else(Vec::new);

            if segments.len() > 0 && segments[0] == "en" {
                segments[0] = "fr";
                let new_path = segments.join("/");
                url.set_path(&new_path);
                // get a new url here or perform an action and queue links
                // pre-fetch all fr locales
                let _ = q.send(url.into());
            }
            g.inc();
        }
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        website.get_links().len()
    )
}

Thanks @oiwn

Full Changelog: v1.84.11...v1.85.4

v1.84.11

09 Mar 15:20
Compare
Choose a tag to compare

Whats Changed

You can now pre-set links to crawl or extend using website.set_extra_links.

  • chore(website): add set extra links extended crawls

@oiwn thanks for the help!

Full Changelog: v1.84.9...v1.84.11

v1.84.9

26 Feb 21:51
Compare
Choose a tag to compare

Whats Changed

Chrome sitemap compile fix and defaulting to chrome.

  1. chore(sitemap): fix chrome sitemap handling
  2. feat(chrome): add auth challenge response
  3. chore(smart): fix smart mode http defaults
  4. chore(chrom_intercept): fix page hang

Full Changelog: v1.84.3...v1.84.9

v1.84.3

25 Feb 12:58
Compare
Choose a tag to compare

Whats Changed

Major increase performance of chrome crawls/scrapes by 2x

  1. perf(chrome): add direct page navigation

Full Changelog: v1.84.1...v1.84.3

v1.84.1

21 Feb 11:46
Compare
Choose a tag to compare

Whats Changed

  1. chore(chrome): fix network_wait_for page hang and inconsistent pages
  2. chore(chrome): fix concurrent page handling
  3. chore(chrome): fix smart mode http request default
  4. feat(chrome): add chrome_headless_new flag

Thanks @esemeniuc

Full Changelog: v1.83.6...v1.84.1

v1.83.0

21 Feb 01:51
Compare
Choose a tag to compare

Whats Changed

  1. feat(chrome): add wait_for configuration delay, selector, and idle network

Full Changelog: v1.82.7...v1.83.0