Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of robotparser-rs #20

Merged
merged 3 commits into from
Jan 31, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
target
Cargo.lock
.vscode/
.idea/
19 changes: 14 additions & 5 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,25 @@ license = "MIT"
name = "robotparser"
readme = "README.md"
repository = "https://github.com/messense/robotparser-rs"
version = "0.10.2"
version = "0.11.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I let the maintainer decide the next version release

edition = "2018"

[dependencies]
url = "1"
url = "2"
percent-encoding = "2.1"

[dependencies.reqwest]
version = "0.9"
version = "0.10.1"
optional = true
features = ["blocking"]

[dependencies.futures]
version = "0.3"
optional = true

[features]
default = ["http"]
http = ["reqwest"]
default = ["reqwest", "futures"]
unstable = []

[dev-dependencies]
tokio = "0.2.11"
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Add it to your ``Cargo.toml``:

```toml
[dependencies]
robotparser = "0.10"
robotparser = "0.11"
```

Add ``extern crate robotparser`` to your crate root and your're good to go!
Expand All @@ -24,14 +24,17 @@ Add ``extern crate robotparser`` to your crate root and your're good to go!
## Examples

```rust
extern crate robotparser;

use robotparser::RobotFileParser;
use robotparser::http::RobotsTxtClient;
use robotparser::service::RobotsTxtService;
use reqwest::Client;
use url::Url;

fn main() {
let parser = RobotFileParser::new("http://www.python.org/robots.txt");
parser.read();
assert!(parser.can_fetch("*", "http://www.python.org/robots.txt"));
let client = Client::new();
let robots_txt_url = Url::parse("http://www.python.org/robots.txt").unwrap();
let robots_txt = client.fetch_robots_txt(robots_txt_url.origin()).unwrap().get_result();
let fetch_url = Url::parse("http://www.python.org/robots.txt").unwrap();
assert!(robots_txt.can_fetch("*", &fetch_url));
}
```

Expand Down
19 changes: 19 additions & 0 deletions src/http.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
//! # Supported libraries
//! To enable support for the required library, you need to add this feature to your `Cargo.toml`.
//! Now only one library is supported - `reqwest`.
//! But you can also add support for other libraries.

use url::Origin;
#[cfg(feature = "reqwest")]
/// Support for reqwest library.
pub mod reqwest;

/// User agent of this crate.
pub const DEFAULT_USER_AGENT: &str = "robotparser-rs (https://crates.io/crates/robotparser)";

/// Trait to fetch and parse the robots.txt file.
/// Must be implemented on http-client.
pub trait RobotsTxtClient {
type Result;
fn fetch_robots_txt(&self, origin: Origin) -> Self::Result;
}
4 changes: 4 additions & 0 deletions src/http/reqwest.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
mod sync_reqwest;
pub use self::sync_reqwest::*;
mod async_reqwest;
pub use self::async_reqwest::*;
76 changes: 76 additions & 0 deletions src/http/reqwest/async_reqwest.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
use reqwest::{Client, Request};
use reqwest::{Method, Error};
use reqwest::header::HeaderValue;
use url::{Origin, Url};
use reqwest::header::USER_AGENT;
use crate::http::{RobotsTxtClient, DEFAULT_USER_AGENT};
use crate::parser::{ParseResult, parse_fetched_robots_txt};
use crate::model::FetchedRobotsTxt;
use std::pin::Pin;
use futures::task::{Context, Poll};
use futures::Future;
use futures::future::TryFutureExt;
use futures::future::ok as future_ok;

type FetchFuture = Box<dyn Future<Output=Result<(ResponseInfo, String), Error>>>;

impl RobotsTxtClient for Client {
type Result = RobotsTxtResponse;
fn fetch_robots_txt(&self, origin: Origin) -> Self::Result {
let url = format!("{}/robots.txt", origin.unicode_serialization());
let url = Url::parse(&url).expect("Unable to parse robots.txt url");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not currently tested but maybe we can add a test for that wdyt ?

let mut request = Request::new(Method::GET, url);
let _ = request.headers_mut().insert(USER_AGENT, HeaderValue::from_static(DEFAULT_USER_AGENT));
let response = self
.execute(request)
.and_then(|response| {
let response_info = ResponseInfo {status_code: response.status().as_u16()};
return response.text().and_then(|response_text| {
return future_ok((response_info, response_text));
});
});
let response: Pin<Box<dyn Future<Output=Result<(ResponseInfo, String), Error>>>> = Box::pin(response);
return RobotsTxtResponse {
origin,
response,
}
}
}

struct ResponseInfo {
status_code: u16,
}

/// Future for fetching robots.txt result.
pub struct RobotsTxtResponse {
origin: Origin,
response: Pin<FetchFuture>,
}

impl RobotsTxtResponse {
/// Returns origin of robots.txt
pub fn get_origin(&self) -> &Origin {
return &self.origin;
}
}

impl Future for RobotsTxtResponse {
type Output = Result<ParseResult<FetchedRobotsTxt>, Error>;

fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
let self_mut = self.get_mut();
let response_pin = self_mut.response.as_mut();
match response_pin.poll(cx) {
Poll::Ready(Ok((response_info, text))) => {
let robots_txt = parse_fetched_robots_txt(self_mut.origin.clone(), response_info.status_code, &text);
return Poll::Ready(Ok(robots_txt));
},
Poll::Ready(Err(error)) => {
return Poll::Ready(Err(error));
},
Poll::Pending => {
return Poll::Pending;
},
}
}
}
23 changes: 23 additions & 0 deletions src/http/reqwest/sync_reqwest.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
use reqwest::blocking::{Client, Request};
use reqwest::{Method, Error};
use reqwest::header::HeaderValue;
use url::{Origin, Url};
use reqwest::header::USER_AGENT;
use crate::http::{RobotsTxtClient, DEFAULT_USER_AGENT};
use crate::parser::{ParseResult, parse_fetched_robots_txt};
use crate::model::FetchedRobotsTxt;

impl RobotsTxtClient for Client {
type Result = Result<ParseResult<FetchedRobotsTxt>, Error>;
fn fetch_robots_txt(&self, origin: Origin) -> Self::Result {
let url = format!("{}/robots.txt", origin.unicode_serialization());
let url = Url::parse(&url).expect("Unable to parse robots.txt url");
let mut request = Request::new(Method::GET, url);
let _ = request.headers_mut().insert(USER_AGENT, HeaderValue::from_static(DEFAULT_USER_AGENT));
let response = self.execute(request)?;
let status_code = response.status().as_u16();
let text = response.text()?;
let robots_txt = parse_fetched_robots_txt(origin, status_code, &text);
return Ok(robots_txt);
}
}
Loading