Skip to content

Releases: spider-rs/spider

v2.2.18

29 Aug 02:03
Compare
Choose a tag to compare

Whats Changed

We can now auto detect locales without losing out on performance. We default enabled the encoding flag for this change!

  • get_html now properly encodes the HTML instead of UTF8 default encoding
  • bump chromiumoxide@0.7.0
  • fix chrome hang on ws connections handler
  • fix fetch stream infinite loop on error
  • fix chrome frame setting url ( this temp prevents hybrid caching from having the req/res for the page )
let mut website: Website = Website::new("https://tenki.jp");
// all of the content output has the proper encoding automatically

Full Changelog: v2.1.9...v2.2.18

v2.1.9

26 Aug 17:07
Compare
Choose a tag to compare

Whats New

This release brings bug fixes with chrome opening pages causing hangs. The builder method website.with_return_page_links can be used to attach the links found on the web page to the page object.

  • chore(chrome): fix instances being left open from ignorable handler errors
  • chore(scrape): add sitemap and smart [#206]
  • feat(page): add return page links configuration
  • chore(config): fix budget reset on crawl end

Thanks @DimitriTimoz

Full Changelog: v2.0.6...v2.1.9

v2.0.6

20 Aug 20:51
Compare
Choose a tag to compare

What's Changed

  • add http response cookies map
  • fix chrome fs feature flag build
  • Update README.md by @James4Ever0 in #203

New Contributors

Full Changelog: v2.0.3...v2.0.6

v2.0.3

14 Aug 11:49
Compare
Choose a tag to compare

Whats Changed

  1. Scrape and Crawl now perform functionality identically as scrape re-uses crawl underneath.
  2. Scrape API cleanup
  3. Add get_chrome_page chrome page ref

Full Changelog: v1.99.30...v2.0.3

v1.99.30

07 Aug 20:33
Compare
Choose a tag to compare

Whats Changed

  • feat Web automation steps by target url or path.
  • add internal ViewPort for chrome handling.
  • add partial eq configuration
    let mut automation_scripts = HashMap::new();

    automation_scripts.insert(
        "/en/blog".into(),
        Vec::from([
            WebAutomation::Evaluate(r#"document.body.style.background = "blue";"#.into()),
            WebAutomation::ScrollY(2000),
            WebAutomation::Click("article a".into()),
            WebAutomation::Wait(5000),
            WebAutomation::Screenshot {
                output: "example.png".into(),
                full_page: true,
                omit_background: true,
            },
        ]),
    );

    let mut website: Website = Website::new("https://rsseau.fr/en/blog")
        .with_chrome_intercept(true, true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_caching(cfg!(feature = "cache"))
        .with_limit(1)
        .with_automation_scripts(Some(automation_scripts))
        .build()
        .unwrap();
web-automation-chrome.mov

Full Changelog: v1.99.21...v1.99.30

v1.99.21

07 Aug 15:40
Compare
Choose a tag to compare

Whats Changed

You can now block ads over the network when using chrome and chrome_intercept using the adblock feature flag.

Full Changelog: v1.99.18...v1.99.21

v1.99.18

05 Aug 19:19
Compare
Choose a tag to compare

Whats Changed

  1. chore(fs,chrome): fix chrome fs storing [#198]

Thanks for the help @haijd

Full Changelog: v1.99.16...v1.99.18

v1.99.16

01 Aug 13:33
Compare
Choose a tag to compare

What's Changed

  • Fixing clap issues #195 by @jmikedupont2 in #196
  • Fix chrome fingerprint and initial document scripts setup
  • Perf improvements for smart mode handling assets with compile time constant map

New Contributors

Full Changelog: v1.99.10...v1.99.16

v1.99.10

19 Jul 19:36
Compare
Choose a tag to compare

Whats Changed

  1. Major performance increase for the CLI scraping and downloading. We now use subscriptions and concurrently stream process files storing and stdout.
  2. You can now use the spider_utils crate to extract content manually as needed using CSS generic CSS element scraping.
use spider::{
    hashbrown::{HashMap, HashSet},
    packages::scraper::Selector,
};
use spider_utils::{build_selectors, css_query_select_map_streamed};

async fn css_query_selector_extract() {
    let map = HashMap::from([(
        "list",
        HashSet::from([".list", ".sub-list"]),
    )]);
    let data = css_query_select_map_streamed(
        r#"<html>
            <body>
                <ul class="list"><li>First</li></ul>
                <ul class="sub-list"><li>Second</li></ul>
            </body>
        </html>"#,
        &build_selectors(map),
    )
    .await;
    println!("{:?}", data);
    // {"list": ["First", "Second"]}
}

thanks for the help @gjtorikian

Full Changelog: v1.99.5...v1.99.10

v1.99.5

13 Jul 00:12
Compare
Choose a tag to compare

Whats Changed

Use website.with_execution_scripts with a hashmap of exact paths and JS to run on the page before returning the result.
This can help writing one off code to unblock crawling a path like a login page.

  • Fix chrome wait_for selector and event handling
  • Major perf improvement when using the regex feature flag and robots.txt handling.

Full Changelog: v1.98.8...v1.99.5