-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple api #39
Comments
I'll start by saying I was in the exact same boat 4 years ago, except I was insane enough to hack together this entire library just to scrape some websites in Rust using xpath expressions. As such I wrote only what I needed, and I had no regard for the official xpath or html specification. Later I decided some other people might find it handy so I open sourced it. A few people eventually found this library and started asking for other xpath features, but the library wasn't written in a way that allowed those features to be added easily, so I rewrote the entire xpath module about 6 months ago, this time following the specification as close as I could. This complete rewrite of the xpath module is why the I may attempt to rewrite the html module to bring them in line soon, but it's a big change that will take time.
No chance I'm adding http requests to this library, simply because there are too many existing http crates in the Rust ecosystem to make a choice for you. Especially since some are async and some are not. This library is fairly light on dependencies and I'd like to keep it that way.
Seem reasonable to add. Every other example you gave is already possible, but maybe a bit verbose, so I'll look into adding more concise functions. |
What you have done for the community is worthy of admiration and approval. A huge thank you to you. A simple API is just a light and understandable syntactic sugar or a functional wrapper that any schoolchild can take and immediately get a result. Even without knowing Rust. Similarity with the API antchfx/htmlquery will make the transition from the Go community to Rust easy. Easy retrieval of the document for processing and clear obtaining of the result, which even a grandmother can understand. As for the HTTP client, the discussion here is not so much about dependencies, but about the completeness of the tool itself. Essentially, there is a simple GET request to the site and a timeout if the site does not respond. In the expect() - either the timeout has expired, or the server error code. In the second case, a simple setup of the client by the crate user and passing it to the function. And this will already raise the crate to a new level. In antchfx/htmlquery, it was necessary to write the client yourself due to the inability to change timeouts, etc. After all, the main task of the crate is to parse HTML pages that are not in a file and not in the code, but on the internet, and it turns out that there is no main tool for getting the page from the internet. You will see for yourself how the presence of this main tool will lead to even greater popularity and use by the community. |
You may be right about different clients, just then you need to add to the documentation how to create a document for processing and a full example of querying and getting the result. For example:
And full example:
|
v0.7.0-beta.1 has addressed some of these feature requests. See #42 for details. |
I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust. Previously, I used what is practically the only decent library in Golang, antchfx/htmlquery (used by 11214), if you don’t count go-colly.
I would simply suggest a similar syntax to htmlquery because names like XpathItemTree look intimidating.
In the readme.md file:
How to install the crate
cargo add skyscraper
Dependencies in the cargo.toml file
And functions with clear names
use skyscraper::html::{Query, Find, FindOne, SelectAttr, SelectOneAttr};
Find and SelectAttr return vec of values.
FindOne and SelectOneAttr return &str value.
As well as a similar API with very simple, understandable examples:
From url
Load HTML document from url. Default timeout is 30s
let doc = Query::url("http://example.com/").expect("");
Load HTML document from URL with client settings
let doc = Query::url_client("http://example.com/", &client).expect("");
From file
From text
Also, add a Find and FindOne function:
Find all A elements.
let list = Find(&doc, "//a").expect("");
Find all A elements that have an href attribute.
let list = Find(&doc, "//a[@href]").expect("");
Find all A elements with href attribute and only return all links.
let list = Find(&doc, "//a/@href").expect("");
Find the first A element.
let a = FindOne(&doc, "//a[1]").expect("");
Find the third A element.
let a = FindOne(&doc, "//a[3]").expect("");
Select Attributes is possible but unnecessary, as you can retrieve an attribute of an element using XPath. Simply, the documentation should include an example of how to do this for those who have forgotten XPath:
//a/@href
//div/@inner_parameter
--
Select all Attribute
Select one Attribute
--
Get count of elements.
But this is just a subjective example of an API that looks simple and understandable.
The text was updated successfully, but these errors were encountered: