Skip to content

Releases: spider-rs/spider

v1.19.41

24 Feb 23:26
Compare
Choose a tag to compare

What's Changed

  • perf(crawl): add join handle task management by @j-mendez in #99
  • chore(crawl): fix task shutdown on termination
  • perf(crawl): reduce memory allocation across crawls
  • chore(crawl): fix gathering timeout duration from robots config

--
Major speed and memory improvements for on large crawls

Full Changelog: v1.19.26...v1.19.41

v1.19.26

19 Feb 17:55
Compare
Choose a tag to compare

Whats Changed

  1. perf(links): add fast pre serialized url anchor link extracting and reduced memory usage
  2. perf(links): fix case sensitivity handling
  3. perf(crawl): reduce memory usage on link gathering
  4. chore(crawl): remove Website.reset method and improve crawl handling resource usage ( reset not needed now )
  5. chore(crawl): add heap usage of links visited
  6. perf(crawl): massive scans capability to utilize more cpu
  7. feat(timeout): add optional configuration.request_timeout duration
  8. build(tokio): remove unused net feature
  9. chore(docs): add missing scrape section
  10. perf(crawl): add compact_str reduce mem x2
  11. perf(scraper): add ahash imp default scraper fork

Full Changelog: v1.18.15...v1.19.26

v1.18.15

04 Feb 20:00
Compare
Choose a tag to compare

Whats Changed

Major

  1. fix stream throttling/delay
  2. perf(selectors): add top level selector building
  3. fix case insensitive link capturing
  4. add inline trap detection
  5. subdomain and tld crawl performance increase

Minor

  1. remove extra string compare conversions before hand
  2. fix unwrap_or with default evaluations

v1.7.5

10 Jan 14:07
Compare
Choose a tag to compare

Whats Changed

  • add parsing of mismatch transports for full site indexing 4cc931c

Full Changelog: v1.17.0...vv1.7.5

v1.17.0

24 Sep 12:41
Compare
Choose a tag to compare

What's Changed

  • feat(controls): add pause, resume, and shutdown crawler

Pause/Resume active crawls

use spider::website::Website;
use spider::utils::{pause, resume};

#[tokio::main]
#[ignore]
async fn main() {
    let url = "https://choosealicense.com/";
    let mut website: Website = Website::new(&url);

    tokio::spawn(async move {
        pause(url).await;
        sleep(Duration::from_millis(5000)).await;
        resume(url).await;
    });

    website.crawl().await;
}

Shutdown crawls

use spider::website::Website;
 use spider::utils::{shutdown};

#[tokio::main]
#[ignore]
async fn main() {
    let url = "https://choosealicense.com/";
    let mut website: Website = Website::new(&url);

    tokio::spawn(async move {
        // really long crawl force shutdown
        sleep(Duration::from_secs(30)).await;
        shutdown(url).await;
    });

    website.crawl().await;
}

Examples show the following pausing, resuming, and shut down crawlers.

v1.15.0

20 Sep 15:18
Compare
Choose a tag to compare

Fast Mode

  • improve performance on crawls by 20-40+% across majority of cases
  • reduce crawl memory usage

Full Changelog: v1.14.2...v1.15.0

v1.14.2

06 Sep 13:26
Compare
Choose a tag to compare

Whats included

  • [feat]: crawl stream handling
  • [perf]: add yield points per cycle

  • Crawl via streams handling leverages tokio async tick counter with the ability. Large website scans run smoother.

v1.13.0

30 Aug 14:16
b22f4f9
Compare
Choose a tag to compare

What's Changed

  • feat(mem): add jemalloc opt memory backend by @j-mendez in #73

-- Notes

Drastic performance boost around 35%-70% depending on OS.

Full Changelog: v1.12.1...v1.13.0

v1.12.1

28 Aug 21:16
d367dad
Compare
Choose a tag to compare

What's Changed

  • feat(runtime): add async crawl handling by @j-mendez in #71
  • removal of website.configuration.concurrency
extern crate spider;

use spider::website::Website;
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    website.crawl().await;
}

Performance

Removing rayon and using an async runtime provides drastic performance increases for io work.

Per increased by over 33-70%.

Full Changelog: v1.11.0...v1.12.1

v1.11.0

26 Jul 12:08
Compare
Choose a tag to compare
  • fix user-agent memory leak between calls on configuration
  • fix user-agent ownership http client
  • fix robot-parser memory leaks
  • add shared connection between robotparser client
  • perf(req): enable brotli
  • chore(statics): add initial static media ignore - remove unsafe macro
  • chore(tls): add ALPN tls defaults

Performance between crawls increased from 1.9s to 1.5s from prev benchmarks.

Full Changelog: v1.10.0...v1.11.0