-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: webCrawl can't handle website well recursively #814
base: main
Are you sure you want to change the base?
Conversation
When using Cheerio Web Scraper and selecting Web Crawl, it is supposed to crawling web pages recursively. There were couple bugs for successfully doing that: 1, extracting urls from webpage with things like default:blank and mailto etc. 2, extracting rules are not well aligned with specification of html <a> element 3, it always returns 10 pages of the website Furthermore, this commit also adds a few informative logs when debug enabled Signed-off-by: Ben Gao <bengao168@msn.com>
Hi @gaord I got a lot of additional weird links from this PR
Results: |
Hi, well tested! thanks. |
Hi @gaord, My testing:
Conclusions:What I can agree on with you is adding this code below, this will help improve scraping performance.:
Logs: Other functionality: cc @HenryHengZJ |
hi there, could you try https://docs.ceph.com/en/quincy/ please? Let me know what you find with the new code. |
Original Code: failed href: because does not cater to This PR: |
hey @gaord thank you so much for the solution! We are going to put this on hold first, since we are going to revamp the URL scraping UI to allow users see what links are scraped, and stop whenever they want. Otherwise this could go on for a long period of time and leave users in the dark as in what have been scraped |
no problem. Do things right is first thing for it saves time. Original code doesn't implement in a way to process web pages. PR is crafted in case it is useful for others. Thanks for your great work to make AI apps easy anyway. |
用最新的配置文件4.6.8 ,对话选gpt-4-turbo 报错: null max_tokens is too large: 62500. This model supports at most 4096 completion tokens, whereas you provided 62500. (request id: 20240202110253407344738SmDnkwX1) 原因是官方gpt-4-turbo 最大的返回token 4096.
When using Cheerio Web Scraper and selecting Web Crawl, it is supposed to crawling web pages recursively. There were couple bugs for successfully doing that:
1, extracting urls from webpage with things like default:blank and mailto etc.
2, extracting rules are not well aligned with specification of html element
3, it always returns 10 pages of the website
Furthermore, this commit also adds a few informative logs when debug enabled