Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple api #39

Open
RustGrow opened this issue Jun 18, 2024 · 4 comments
Open

Simple api #39

RustGrow opened this issue Jun 18, 2024 · 4 comments

Comments

@RustGrow
Copy link

RustGrow commented Jun 18, 2024

I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust. Previously, I used what is practically the only decent library in Golang, antchfx/htmlquery (used by 11214), if you don’t count go-colly.

I would simply suggest a similar syntax to htmlquery because names like XpathItemTree look intimidating.

In the readme.md file:
How to install the crate
cargo add skyscraper

Dependencies in the cargo.toml file

[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
skyscraper = "0.6.4"
reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
tokio = { version = "1", features = ["full"] }

And functions with clear names
use skyscraper::html::{Query, Find, FindOne, SelectAttr, SelectOneAttr};
Find and SelectAttr return vec of values.
FindOne and SelectOneAttr return &str value.

As well as a similar API with very simple, understandable examples:
From url

  1. Load HTML document from url. Default timeout is 30s
    let doc = Query::url("http://example.com/").expect("");

  2. Load HTML document from URL with client settings
    let doc = Query::url_client("http://example.com/", &client).expect("");

From file

let file_path = "/home/user/sample.html";
let doc = Query::file(file_path).expect("");

From text

let text = r#"<html>....</html>"#;
let doc = Query::text(text).expect("");

Also, add a Find and FindOne function:
Find all A elements.
let list = Find(&doc, "//a").expect("");

Find all A elements that have an href attribute.
let list = Find(&doc, "//a[@href]").expect("");

Find all A elements with href attribute and only return all links.
let list = Find(&doc, "//a/@href").expect("");

Find the first A element.
let a = FindOne(&doc, "//a[1]").expect("");

Find the third A element.
let a = FindOne(&doc, "//a[3]").expect("");

Select Attributes is possible but unnecessary, as you can retrieve an attribute of an element using XPath. Simply, the documentation should include an example of how to do this for those who have forgotten XPath:
//a/@href
//div/@inner_parameter

--
Select all Attribute

let attr = SelectAttr(&doc, "//img", "src").expect("");

Select one Attribute

let attr = SelectOneAttr(&doc, "//img[1]", "src").expect("");

--
Get count of elements.

let list = Find(&doc, "//a").expect("");
let count = list.len();

But this is just a subjective example of an API that looks simple and understandable.

@James-LG
Copy link
Owner

James-LG commented Jun 19, 2024

I have some experience with parsing using xpath and I was very disappointed that there isn’t a proper crate for parsing websites in Rust.

I'll start by saying I was in the exact same boat 4 years ago, except I was insane enough to hack together this entire library just to scrape some websites in Rust using xpath expressions. As such I wrote only what I needed, and I had no regard for the official xpath or html specification.

Later I decided some other people might find it handy so I open sourced it. A few people eventually found this library and started asking for other xpath features, but the library wasn't written in a way that allowed those features to be added easily, so I rewrote the entire xpath module about 6 months ago, this time following the specification as close as I could.

This complete rewrite of the xpath module is why the XpathItemTree exists at the moment. The tree follows the xpath specification, but I didn't have time to rewrite the html module, so the tree became a bridge from the old HtmlDocument to the new XpathItemTree. Ideally the html module would directly return what is now called the XpathItemTree but could be called HtmlDocument.

I may attempt to rewrite the html module to bring them in line soon, but it's a big change that will take time.

Load HTML document from url. Default timeout is 30s

No chance I'm adding http requests to this library, simply because there are too many existing http crates in the Rust ecosystem to make a choice for you. Especially since some are async and some are not. This library is fairly light on dependencies and I'd like to keep it that way.

From file

Seem reasonable to add.


Every other example you gave is already possible, but maybe a bit verbose, so I'll look into adding more concise functions.

@RustGrow
Copy link
Author

RustGrow commented Jun 19, 2024

What you have done for the community is worthy of admiration and approval. A huge thank you to you. A simple API is just a light and understandable syntactic sugar or a functional wrapper that any schoolchild can take and immediately get a result. Even without knowing Rust. Similarity with the API antchfx/htmlquery will make the transition from the Go community to Rust easy. Easy retrieval of the document for processing and clear obtaining of the result, which even a grandmother can understand.

As for the HTTP client, the discussion here is not so much about dependencies, but about the completeness of the tool itself. Essentially, there is a simple GET request to the site and a timeout if the site does not respond. In the expect() - either the timeout has expired, or the server error code. In the second case, a simple setup of the client by the crate user and passing it to the function. And this will already raise the crate to a new level. In antchfx/htmlquery, it was necessary to write the client yourself due to the inability to change timeouts, etc. After all, the main task of the crate is to parse HTML pages that are not in a file and not in the code, but on the internet, and it turns out that there is no main tool for getting the page from the internet. You will see for yourself how the presence of this main tool will lead to even greater popularity and use by the community.

@RustGrow
Copy link
Author

You may be right about different clients, just then you need to add to the documentation how to create a document for processing and a full example of querying and getting the result. For example:

  1. Load HTML document from url. Default timeout is 30s
let client = reqwest::Client::builder()
        .timeout(Duration::from_secs(30)) // Set a 30-second timeout
        .build()?;

    let res = client
        .get("https://finance.yahoo.com/?guccounter=1")
        .send()
        .await?;

    // Parse the HTML text
    let doc = html::parse(&res.text().await?)?;

And full example:

[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
skyscraper = "0.7.0-beta.0"
reqwest = { version = "0.12.4", features = ["default", "blocking", "cookies", "json", "socks"] }
tokio = { version = "1", features = ["full"] }
use reqwest;
use serde::{Deserialize, Serialize};
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree};
use std::error::Error;
use std::fs::File;
use std::io::prelude::*;
use tokio::time::Duration;

#[derive(Serialize, Deserialize, Debug)]
struct TestXpath<'a> {
    result: &'a str,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let mut queries: Vec<TestXpath> = vec![];

    let client = reqwest::Client::builder()
        .timeout(Duration::from_secs(30)) // Set a 30-second timeout
        .build()?;

    let res = client
        .get("https://finance.yahoo.com/?guccounter=1")
        .send()
        .await?;

    // Parse the HTML text
    let doc = html::parse(&res.text().await?)?;
    let xpath_item_tree = XpathItemTree::from(&doc);

    // Assuming your XPath string is static, it is safe to use `expect` during parsing
    let test_xpath = xpath::parse("//ul[@class=\"story-items svelte-6i0owd\"]//a/@href")
        .expect("xpath is invalid")
        .apply(&xpath_item_tree)?;

    for item in test_xpath.iter() {
        let res = TestXpath {
            result: &item.extract_as_node().extract_as_attribute_node().value,
        };
        queries.push(res);
    }

    // Serialize it to a JSON string.
    let test_query = serde_json::to_string(&queries)?;

    let mut file = File::create("output.json")?;
    file.write_all(test_query.as_bytes())?;

    Ok(())
}

@James-LG
Copy link
Owner

v0.7.0-beta.1 has addressed some of these feature requests. See #42 for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants