Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for sel to select and wrap multiple element results #10

Merged
merged 5 commits into from
Sep 27, 2024

Conversation

Isaac-Flath
Copy link
Contributor

@Isaac-Flath Isaac-Flath commented Sep 26, 2024

Fix bug in read_html.

I added tests to ensure it is grabbing something when various selectors are passed.

@Isaac-Flath Isaac-Flath requested a review from jph00 September 26, 2024 00:29
@jph00
Copy link
Collaborator

jph00 commented Sep 26, 2024

Hmm interesting. I see why you're doing it this way, although I might need a tiny bit of convincing that just concatenating the strings is the right approach...

@Isaac-Flath
Copy link
Contributor Author

Isaac-Flath commented Sep 26, 2024

The most important thing I want to fix is a bug that prevents a function from working (see first screenshot). There are two options for fixing it after more research.

  • select_one : page = str(soup.select_one(sel)): This gets the first thing that matches the CSS selector if there are multiple on the page. This mirrors the current code more closely.
  • select : page ''.join(str(el) for el in soup.select(sel)) : This will be more conservative, and if a CSS selector appears multiple times on the page, it will get them all. I think there are 2 good reasons to use this:
    • If you don't realize there's a second (or more) match to the CSS selector in the URL, this error is on the side of bringing in what you need + more. Selecting one could err on the side of not bringing in what you need because it only grabbed the first one.
    • It allows for nifty stuff that I am not 100% sure will be useful (but I think it has potential). See the second screenshot where I use select to grab all the .listing-descriptions from our quarto blog. If you use select_one is just gives the first one instead of all of the .listing-descriptionss for each blog post.

Bug Screenshot

Screenshot 2024-09-26 at 6 09 11 AM

Select vs Select_one

Screenshot 2024-09-26 at 6 16 31 AM

@Isaac-Flath
Copy link
Contributor Author

We could also do both, and have a multi=False arg, so it defaults to the current intended behavior but allows for the multiple selection option also. Maybe that's best?

@jph00
Copy link
Collaborator

jph00 commented Sep 26, 2024

That's a good idea. And maybe use \n as the join str?

@jph00
Copy link
Collaborator

jph00 commented Sep 26, 2024

And how about also having an optional wrap_tag, e.g. if set to document each returned element is wrapped with <document>...</document>?

@Isaac-Flath
Copy link
Contributor Author

I've implemented both of them. I am not super happy with the code, but I am not sure it can be improved significantly since the wrap_arg injection needs to happen after the HTML to markdown conversion. If you have ways this can be done cleaner, that'd be helpful.

@jph00
Copy link
Collaborator

jph00 commented Sep 27, 2024

I'm not super happy with the code either, but I also can't see a better way to do it offhand. So I'll just merge it now, but if you come up with something better later, feel free to put in another PR, of course! :)

@jph00 jph00 merged commit 09a36fe into AnswerDotAI:main Sep 27, 2024
@jph00 jph00 added the enhancement New feature or request label Sep 27, 2024
@jph00 jph00 changed the title Bug fix read_html css selection Allow for sel to select and wrap multiple element results Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants