Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Closes the Following Issues
Partially Completes Issues
Technical Details
process_astro_doc_page_content
: which gets rid of noisey not useful content such as nav bar, footer, header and only extract the main page article contentscrape_page
(which scraps the HTML content AND finds scraps all its sub pages using links contained). This is done since 1. there is already a centralized util function calledfetch_page_content()
that does the job of fetching each page's HTML elements, 2. there is already a centralized util function calledget_internal_links
that finds all links in, 3. the scraping process itself does not exclude noisey unrelated content which is replaced by the function in the previous bullet pointprocess_astro_doc_page_content
split_html
: it previously splits on specific HTML tags usingHTMLHeaderTextSplitter
but it is not ideal as we do not want to split that often and there is no guarantee splitting on such tags retains semantic meaning. This is changed to usingRecursiveCharacterTextSplitter
with a token limit. This will ONLY split if the chunk starts exceeding a certain number of specified token amount. If it still exceeds then go down the separator list and split further, until splitting by space and character to fit into token limit. This retains better semantic meaning in each chunks and enforces token limit.fetch_page_content
to add auto retry with exponential backoff using tenacity.get_page_links
to make it traverse a given page recursively and finds all links related to this website. This ensures no duplicate pages are traversed and no pages are missing. Previously, the logic is missing some links when traversing potentially due to the fact that it is using a for loop and not doing recursive traversal until all links are exhausted.Results
Astro Docs: Better URLs Fetched + Crawling Improvement + HTML Splitter Improvement
See old links: astro_docs_links_old.txt
new_astro_docs_links.txt
Evaluation
evaluation_result_improve_1.csv