Skip to content

Commit

Permalink
feat(disk): add hybrid disk storing links
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Dec 24, 2024
1 parent ffab413 commit 1884ed4
Show file tree
Hide file tree
Showing 42 changed files with 1,451 additions and 157 deletions.
664 changes: 654 additions & 10 deletions Cargo.lock

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion examples/advanced_configuration.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ async fn main() -> Result<(), Error> {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
2 changes: 1 addition & 1 deletion examples/budget.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ async fn main() -> Result<(), Error> {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
2 changes: 1 addition & 1 deletion examples/chrome.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ async fn crawl_website(url: &str) -> Result<()> {

let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/chrome_remote.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ async fn crawl_website(url: &str) -> Result<()> {

let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl({}) is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/chrome_screenshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ async fn main() {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/chrome_screenshot_with_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ async fn main() {
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/chrome_viewport.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ async fn main() {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
2 changes: 1 addition & 1 deletion examples/chrome_web_automation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ async fn main() {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
2 changes: 1 addition & 1 deletion examples/configuration.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ async fn main() -> Result<(), Error> {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
2 changes: 1 addition & 1 deletion examples/css_scrape.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ async fn main() {
format!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
website.get_size().await
)
.as_bytes(),
)
Expand Down
2 changes: 1 addition & 1 deletion examples/depth.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ async fn main() -> Result<(), Error> {
website.crawl().await;
let duration: std::time::Duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
7 changes: 2 additions & 5 deletions examples/download.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,8 @@ async fn main() {
.open(&download_file)
.expect("Unable to open file");

match page.get_bytes() {
Some(b) => {
file.write_all(b).unwrap_or_default();
}
_ => (),
if let Some(b) = page.get_bytes() {
file.write_all(b).unwrap_or_default();
}

log("downloaded", download_file)
Expand Down
2 changes: 1 addition & 1 deletion examples/example.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ async fn main() {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
2 changes: 1 addition & 1 deletion examples/loop.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ async fn main() {
format!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}\n",
duration,
website.get_links().len()
website.get_size().await
)
.as_bytes(),
)
Expand Down
2 changes: 1 addition & 1 deletion examples/openai.rs
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ async fn main() {
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/openai_cache.rs
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ async fn main() {

let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/openai_extra.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ async fn main() {
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/openai_multi.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ async fn main() {
let start = crate::tokio::time::Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
let links = website.get_all_links_visited().await;

println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
Expand Down
2 changes: 1 addition & 1 deletion examples/queue.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,6 @@ async fn main() {
println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
website.get_size().await
)
}
2 changes: 1 addition & 1 deletion examples/real_world.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ async fn crawl_website(url: &str) -> Result<()> {
async move {
website.crawl().await;
website.unsubscribe();
website.get_links()
website.get_all_links_visited().await
},
async move {
while let Ok(page) = rx2.recv().await {
Expand Down
2 changes: 1 addition & 1 deletion examples/serde.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ async fn main() {

website.crawl().await;

let links = website.get_links();
let links = website.get_all_links_visited().await;

let mut s = flexbuffers::FlexbufferSerializer::new();

Expand Down
2 changes: 1 addition & 1 deletion examples/sitemap.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ async fn main() {

let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
2 changes: 1 addition & 1 deletion examples/smart.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@ async fn main() {

website.crawl_smart().await;

println!("Links found {:?}", website.get_links().len());
println!("Links found {:?}", website.get_size().await);
}
2 changes: 1 addition & 1 deletion examples/subscribe.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ async fn main() {
format!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
website.get_size().await
)
.as_bytes(),
)
Expand Down
7 changes: 2 additions & 5 deletions examples/subscribe_download.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,8 @@ async fn main() {
.await
.expect("Unable to open file");

match page.get_bytes() {
Some(b) => {
file.write_all(b).await.unwrap_or_default();
}
_ => (),
if let Some(b) = page.get_bytes() {
file.write_all(b).await.unwrap_or_default();
}

log("downloaded", download_file)
Expand Down
2 changes: 1 addition & 1 deletion examples/transform_markdown.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ async fn main() {
format!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
website.get_links().len()
website.get_size().await
)
.as_bytes(),
)
Expand Down
2 changes: 1 addition & 1 deletion examples/url_glob_subdomains.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ async fn main() {
website.crawl().await;
let duration = start.elapsed();

let links = website.get_links();
let links = website.get_all_links_visited().await;

for link in links.iter() {
println!("- {:?}", link.as_ref());
Expand Down
8 changes: 6 additions & 2 deletions spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "2.21.33"
version = "2.22.2"
authors = [
"j-mendez <jeff@spider.cloud>"
]
Expand Down Expand Up @@ -70,6 +70,7 @@ statrs = { version = "0.17", optional = true }
aho-corasick = { version = "1", optional = true }
tracing = { version = "0.1", default-features = false, features = ["std"], optional = true }
sysinfo = { version = "0.33", default-features = false, features = ["system"], optional = true }
sqlx = { version = "0.8", features = [ "runtime-tokio", "sqlite" ], optional = true }

[dependencies.spider_chrome]
version = "2"
Expand Down Expand Up @@ -113,7 +114,10 @@ reqwest = { version = "0.12", features = [
] }

[features]
default = ["sync", "reqwest_native_tls_native_roots", "cookies", "ua_generator", "encoding", "string_interner_buffer_backend", "balance"]
default = ["sync", "reqwest_native_tls_native_roots", "disk_native_tls", "cookies", "ua_generator", "encoding", "string_interner_buffer_backend", "balance"]
disk = ["dep:sqlx"]
disk_native_tls = ["disk", "sqlx/runtime-tokio-native-tls"]
disk_aws = ["disk", "sqlx/tls-rustls-aws-lc-rs"]
adblock = ["chrome", "spider_chrome/adblock"]
balance = ["dep:sysinfo"]
regex = []
Expand Down
6 changes: 5 additions & 1 deletion spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,8 @@ spider = { version = "2", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
1. `regex`: Enables blacklisting paths with regx
1. `regex`: Enables blacklisting and whitelisting paths with regex.
1. `disk`: Enables SQLite hybrid disk storage to balance memory usage.
1. `jemalloc`: Enables the [jemalloc](https://github.com/jemalloc/jemalloc) memory backend.
1. `decentralized`: Enables decentralized processing of IO, requires the [spider_worker](../spider_worker/README.md) startup before crawls.
1. `sync`: Subscribe to changes for Page data processing async. [Enabled by default]
Expand Down Expand Up @@ -132,6 +133,9 @@ spider = { version = "2", features = ["regex", "ua_generator"] }
1. `headers`: Enables the extraction of header information on each retrieved page. Adds a `headers` field to the page struct.
1. `decentralized_headers`: Enables the extraction of suppressed header information of the decentralized processing of IO.
This is needed if `headers` is set in both [spider](../spider/README.md) and [spider_worker](../spider_worker/README.md).
1. `string_interner_buffer_backend`: Enables the String interning using the buffer backend [default].
1. `string_interner_string_backend`: Enables the String interning using the string backend.
1. `string_interner_bucket_backend`: Enables the String interning using the bucket backend.

### Decentralization

Expand Down
Loading

0 comments on commit 1884ed4

Please sign in to comment.