24 Feb 23:26

j-mendez

505edb0

v1.19.41

What's Changed

perf(crawl): add join handle task management by @j-mendez in #99
chore(crawl): fix task shutdown on termination
perf(crawl): reduce memory allocation across crawls
chore(crawl): fix gathering timeout duration from robots config

--
Major speed and memory improvements for on large crawls

Full Changelog: v1.19.26...v1.19.41

Contributors

j-mendez

Assets 2

19 Feb 17:55

j-mendez

v1.19.26

ed37eef

v1.19.26

Whats Changed

perf(links): add fast pre serialized url anchor link extracting and reduced memory usage
perf(links): fix case sensitivity handling
perf(crawl): reduce memory usage on link gathering
chore(crawl): remove Website.reset method and improve crawl handling resource usage ( reset not needed now )
chore(crawl): add heap usage of links visited
perf(crawl): massive scans capability to utilize more cpu
feat(timeout): add optional configuration.request_timeout duration
build(tokio): remove unused net feature
chore(docs): add missing scrape section
perf(crawl): add compact_str reduce mem x2
perf(scraper): add ahash imp default scraper fork

Full Changelog: v1.18.15...v1.19.26

Assets 2

04 Feb 20:00

j-mendez

v1.18.15

ea24b90

v1.18.15

Whats Changed

Major

fix stream throttling/delay
perf(selectors): add top level selector building
fix case insensitive link capturing
add inline trap detection
subdomain and tld crawl performance increase

Minor

remove extra string compare conversions before hand
fix unwrap_or with default evaluations

Assets 2

10 Jan 14:07

j-mendez

vv1.7.5

4cc931c

v1.7.5

Whats Changed

add parsing of mismatch transports for full site indexing 4cc931c

Full Changelog: v1.17.0...vv1.7.5

Assets 2

24 Sep 12:41

j-mendez

v1.17.0

d2f58ed

v1.17.0

What's Changed

feat(controls): add pause, resume, and shutdown crawler

Pause/Resume active crawls

use spider::website::Website;
use spider::utils::{pause, resume};

#[tokio::main]
#[ignore]
async fn main() {
    let url = "https://choosealicense.com/";
    let mut website: Website = Website::new(&url);

    tokio::spawn(async move {
        pause(url).await;
        sleep(Duration::from_millis(5000)).await;
        resume(url).await;
    });

    website.crawl().await;
}

Shutdown crawls

use spider::website::Website;
 use spider::utils::{shutdown};

#[tokio::main]
#[ignore]
async fn main() {
    let url = "https://choosealicense.com/";
    let mut website: Website = Website::new(&url);

    tokio::spawn(async move {
        // really long crawl force shutdown
        sleep(Duration::from_secs(30)).await;
        shutdown(url).await;
    });

    website.crawl().await;
}

Examples show the following pausing, resuming, and shut down crawlers.

Assets 2

20 Sep 15:18

j-mendez

v1.15.0

383661c

v1.15.0

Fast Mode

improve performance on crawls by 20-40+% across majority of cases
reduce crawl memory usage

Full Changelog: v1.14.2...v1.15.0

Assets 2

06 Sep 13:26

j-mendez

v1.14.2

4db9f00

v1.14.2

Whats included

[feat]: crawl stream handling
[perf]: add yield points per cycle

Crawl via streams handling leverages tokio async tick counter with the ability. Large website scans run smoother.

Assets 2

30 Aug 14:16

j-mendez

v1.13.0

b22f4f9

v1.13.0

What's Changed

feat(mem): add jemalloc opt memory backend by @j-mendez in #73

-- Notes

Drastic performance boost around 35%-70% depending on OS.

Full Changelog: v1.12.1...v1.13.0

Contributors

j-mendez

Assets 2

28 Aug 21:16

j-mendez

v1.12.1

d367dad

v1.12.1

What's Changed

feat(runtime): add async crawl handling by @j-mendez in #71
removal of website.configuration.concurrency

extern crate spider;

use spider::website::Website;
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    website.crawl().await;
}

Performance

Removing rayon and using an async runtime provides drastic performance increases for io work.

Per increased by over 33-70%.

Full Changelog: v1.11.0...v1.12.1

Contributors

j-mendez

Assets 2

26 Jul 12:08

j-mendez

v1.11.0

daa5b23

v1.11.0

fix user-agent memory leak between calls on configuration
fix user-agent ownership http client
fix robot-parser memory leaks
add shared connection between robotparser client
perf(req): enable brotli
chore(statics): add initial static media ignore - remove unsafe macro
chore(tls): add ALPN tls defaults

Performance between crawls increased from 1.9s to 1.5s from prev benchmarks.

Full Changelog: v1.10.0...v1.11.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

Whats Changed

Whats Changed

Whats Changed

What's Changed

Pause/Resume active crawls

Shutdown crawls

Fast Mode

What's Changed

Contributors

What's Changed

Performance

Contributors

Releases: spider-rs/spider

v1.19.41

What's Changed

Contributors

v1.19.26

Whats Changed

v1.18.15

Whats Changed

v1.7.5

Whats Changed

v1.17.0

What's Changed

Pause/Resume active crawls

Shutdown crawls

v1.15.0

Fast Mode

v1.14.2

v1.13.0

What's Changed

Contributors

v1.12.1

What's Changed

Performance

Contributors

v1.11.0