Releases: spider-rs/spider
Releases · spider-rs/spider
v1.19.41
What's Changed
- perf(crawl): add join handle task management by @j-mendez in #99
- chore(crawl): fix task shutdown on termination
- perf(crawl): reduce memory allocation across crawls
- chore(crawl): fix gathering timeout duration from robots config
--
Major speed and memory improvements for on large crawls
Full Changelog: v1.19.26...v1.19.41
v1.19.26
Whats Changed
- perf(links): add fast pre serialized url anchor link extracting and reduced memory usage
- perf(links): fix case sensitivity handling
- perf(crawl): reduce memory usage on link gathering
- chore(crawl): remove
Website.reset
method and improve crawl handling resource usage (reset
not needed now ) - chore(crawl): add heap usage of links visited
- perf(crawl): massive scans capability to utilize more cpu
- feat(timeout): add optional
configuration.request_timeout
duration - build(tokio): remove unused
net
feature - chore(docs): add missing scrape section
- perf(crawl): add
compact_str
reduce mem x2 - perf(scraper): add ahash imp default scraper fork
Full Changelog: v1.18.15...v1.19.26
v1.18.15
Whats Changed
Major
- fix stream throttling/delay
- perf(selectors): add top level selector building
- fix case insensitive link capturing
- add inline trap detection
- subdomain and tld crawl performance increase
Minor
- remove extra string compare conversions before hand
- fix unwrap_or with default evaluations
v1.7.5
v1.17.0
What's Changed
- feat(controls): add pause, resume, and shutdown crawler
Pause/Resume active crawls
use spider::website::Website;
use spider::utils::{pause, resume};
#[tokio::main]
#[ignore]
async fn main() {
let url = "https://choosealicense.com/";
let mut website: Website = Website::new(&url);
tokio::spawn(async move {
pause(url).await;
sleep(Duration::from_millis(5000)).await;
resume(url).await;
});
website.crawl().await;
}
Shutdown crawls
use spider::website::Website;
use spider::utils::{shutdown};
#[tokio::main]
#[ignore]
async fn main() {
let url = "https://choosealicense.com/";
let mut website: Website = Website::new(&url);
tokio::spawn(async move {
// really long crawl force shutdown
sleep(Duration::from_secs(30)).await;
shutdown(url).await;
});
website.crawl().await;
}
Examples show the following pausing, resuming, and shut down crawlers.
v1.15.0
Fast Mode
- improve performance on crawls by 20-40+% across majority of cases
- reduce crawl memory usage
Full Changelog: v1.14.2...v1.15.0
v1.14.2
v1.13.0
v1.12.1
What's Changed
- feat(runtime): add async crawl handling by @j-mendez in #71
- removal of
website.configuration.concurrency
extern crate spider;
use spider::website::Website;
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://choosealicense.com");
website.crawl().await;
}
Performance
Removing rayon
and using an async runtime provides drastic performance increases for io
work.
Per increased by over 33-70%
.
Full Changelog: v1.11.0...v1.12.1
v1.11.0
- fix user-agent memory leak between calls on configuration
- fix user-agent ownership http client
- fix robot-parser memory leaks
- add shared connection between robotparser client
- perf(req): enable brotli
- chore(statics): add initial static media ignore - remove unsafe macro
- chore(tls): add ALPN tls defaults
Performance between crawls increased from 1.9s to 1.5s from prev benchmarks.
Full Changelog: v1.10.0...v1.11.0