Whats Changed

The AI results now return the input(prompt), js_ouput, and content_output.

Full Changelog: v1.90.0...v.1.91.1

Assets 2

26 Mar 16:50

j-mendez

v1.88.7

0cc921e

v1.88.7

Whats Changed

You can now drive the browser with multi steps towards a goal, example. Extracting content or gathering extra data can be done as well using GPTConfigs.extra_ai_data.

The credits used can be checked with Page.openai_credits_used.

chore(page): return all page content regardless of status
chore(openai): fix svg removal
feat(openai): add extra data gpt curating
chore(openai): add credits used response
feat(fingerprint): add fingerprint id configuration

use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::website::Website;

#[tokio::main]
async fn main() {
    let gpt_config: GPTConfigs = GPTConfigs::new_multi(
        "gpt-4-1106-preview",
        vec![
            "Search for Movies",
            "Click on the first result movie result",
        ],
        500,
    );
    
   let mut website: Website = Website::new("https://www.google.com")
        .with_openai(Some(gpt_config))
        .with_limit(1)
        .build()
        .unwrap();
        
     website.crawl().await;
}

Full Changelog: v1.87.3...v1.88.7

Assets 2

25 Mar 18:44

j-mendez

v1.87.3

efa6a16

v1.87.3

Whats Changed

You can now bypass Cloudflare protected page with the feature flag [real_browser].

feat(real_browser): add real_browser feature flag for chrome

Full Changelog: v1.86.16...v1.87.3

Assets 2

19 Mar 17:04

j-mendez

v1.86.16

9b7a90f

v1.86.16

What's Changed

You can now dynamically drive the browser with custom scripts using OpenAI.
Make sure to set the OPENAI_API_KEY env variable or pass it in to the program.

Openai/chrome driver by @j-mendez in #174
chore(page): add cold fusion file crawling support

extern crate spider;
use spider::configuration::{GPTConfigs, WaitForIdleNetwork};
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let _ = tokio::fs::create_dir_all("./storage/").await;

    let screenshot_params =
        spider::configuration::ScreenshotParams::new(Default::default(), Some(true), Some(true));
    let screenshot_config =
        spider::configuration::ScreenShotConfig::new(screenshot_params, true, true, None);

    let mut website: Website = Website::new("https://google.com")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_screenshot(Some(screenshot_config))
        .with_limit(1)
        .with_openai(Some(GPTConfigs::new(
            "gpt-4-1106-preview",
            "Search for Movies",
            500,
        )))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("{}\n{}", page.get_url(), page.get_html());
        }
    });

    website.crawl().await;
}

The output of the custom script from the AI:

The screenshot of the page output:

Full Changelog: v1.85.4...v1.86.16

Contributors

j-mendez

Assets 2

11 Mar 18:50

j-mendez

v1.85.4

5322150

v1.85.4

Whats Changed

You can now update the crawl links outside of the context by using website.queue to get a sender.

feat(q): add mid crawl queue
chore(chrome): fix semaphore limiting scrape

use spider::tokio;
use spider::url::Url;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();
    let mut g = website.subscribe_guard().unwrap();
    let q = website.queue(100).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            let u = res.get_url();
            println!("{:?}", u);
            let mut url = Url::parse(u).expect("Failed to parse URL");

            let mut segments: Vec<_> = url
                .path_segments()
                .map(|c| c.collect::<Vec<_>>())
                .unwrap_or_else(Vec::new);

            if segments.len() > 0 && segments[0] == "en" {
                segments[0] = "fr";
                let new_path = segments.join("/");
                url.set_path(&new_path);
                // get a new url here or perform an action and queue links
                // pre-fetch all fr locales
                let _ = q.send(url.into());
            }
            g.inc();
        }
    });

    let start = std::time::Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        website.get_links().len()
    )
}