Releases: spider-rs/spider
v2.2.18
Whats Changed
We can now auto detect locales without losing out on performance. We default enabled the encoding
flag for this change!
- get_html now properly encodes the HTML instead of UTF8 default encoding
- bump chromiumoxide@0.7.0
- fix chrome hang on ws connections handler
- fix fetch stream infinite loop on error
- fix chrome frame setting url ( this temp prevents hybrid caching from having the req/res for the page )
let mut website: Website = Website::new("https://tenki.jp");
// all of the content output has the proper encoding automatically
Full Changelog: v2.1.9...v2.2.18
v2.1.9
Whats New
This release brings bug fixes with chrome opening pages causing hangs. The builder method website.with_return_page_links
can be used to attach the links found on the web page to the page object.
- chore(chrome): fix instances being left open from ignorable handler errors
- chore(scrape): add sitemap and smart [#206]
- feat(page): add return page links configuration
- chore(config): fix budget reset on crawl end
Thanks @DimitriTimoz
Full Changelog: v2.0.6...v2.1.9
v2.0.6
What's Changed
- add http response cookies map
- fix chrome fs feature flag build
- Update README.md by @James4Ever0 in #203
New Contributors
- @James4Ever0 made their first contribution in #203
Full Changelog: v2.0.3...v2.0.6
v2.0.3
Whats Changed
- Scrape and Crawl now perform functionality identically as scrape re-uses crawl underneath.
- Scrape API cleanup
- Add get_chrome_page chrome page ref
Full Changelog: v1.99.30...v2.0.3
v1.99.30
Whats Changed
- feat Web automation steps by target url or path.
- add internal ViewPort for chrome handling.
- add partial eq configuration
let mut automation_scripts = HashMap::new();
automation_scripts.insert(
"/en/blog".into(),
Vec::from([
WebAutomation::Evaluate(r#"document.body.style.background = "blue";"#.into()),
WebAutomation::ScrollY(2000),
WebAutomation::Click("article a".into()),
WebAutomation::Wait(5000),
WebAutomation::Screenshot {
output: "example.png".into(),
full_page: true,
omit_background: true,
},
]),
);
let mut website: Website = Website::new("https://rsseau.fr/en/blog")
.with_chrome_intercept(true, true)
.with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
.with_caching(cfg!(feature = "cache"))
.with_limit(1)
.with_automation_scripts(Some(automation_scripts))
.build()
.unwrap();
web-automation-chrome.mov
Full Changelog: v1.99.21...v1.99.30
v1.99.21
Whats Changed
You can now block ads over the network when using chrome and chrome_intercept using the adblock
feature flag.
Full Changelog: v1.99.18...v1.99.21
v1.99.18
v1.99.16
What's Changed
- Fixing clap issues #195 by @jmikedupont2 in #196
- Fix chrome fingerprint and initial document scripts setup
- Perf improvements for smart mode handling assets with compile time constant map
New Contributors
- @jmikedupont2 made their first contribution in #196
Full Changelog: v1.99.10...v1.99.16
v1.99.10
Whats Changed
- Major performance increase for the CLI scraping and downloading. We now use subscriptions and concurrently stream process files storing and stdout.
- You can now use the
spider_utils
crate to extract content manually as needed using CSS generic CSS element scraping.
use spider::{
hashbrown::{HashMap, HashSet},
packages::scraper::Selector,
};
use spider_utils::{build_selectors, css_query_select_map_streamed};
async fn css_query_selector_extract() {
let map = HashMap::from([(
"list",
HashSet::from([".list", ".sub-list"]),
)]);
let data = css_query_select_map_streamed(
r#"<html>
<body>
<ul class="list"><li>First</li></ul>
<ul class="sub-list"><li>Second</li></ul>
</body>
</html>"#,
&build_selectors(map),
)
.await;
println!("{:?}", data);
// {"list": ["First", "Second"]}
}
thanks for the help @gjtorikian
Full Changelog: v1.99.5...v1.99.10
v1.99.5
Whats Changed
Use website.with_execution_scripts
with a hashmap of exact paths and JS to run on the page before returning the result.
This can help writing one off code to unblock crawling a path like a login page.
- Fix chrome wait_for selector and event handling
- Major perf improvement when using the
regex
feature flag and robots.txt handling.
Full Changelog: v1.98.8...v1.99.5