-
Notifications
You must be signed in to change notification settings - Fork 50
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion (#293)
### Description - This is 1st part of a 2 part effort to improve the scraping, extraction, chunking and tokenizing logic for Ask Astro's data ingestion process. (see details in this issue #258) - This PR mainly focuses on improving noise from ingestion process of the Astro Docs data source, along with some other related changes such as only scraping the latest doc versions, add auto exponential backoff on html get function and etc. ### Closes the Following Issues - #292 - #270 - #209 ### Partially Completes Issues - #258 (2 part effort, only 1 PR completed) - #221 (tackles token limit in html splitting logic, other parts needs tackling still) ### Technical Details - airflow/include/tasks/extract/astro_docs.py - Add function `process_astro_doc_page_content`: which gets rid of noisey not useful content such as nav bar, footer, header and only extract the main page article content - Remove the previous function `scrape_page` (which scraps the HTML content AND finds scraps all its sub pages using links contained). This is done since 1. there is already a centralized util function called `fetch_page_content()` that does the job of fetching each page's HTML elements, 2. there is already a centralized util function called `get_internal_links` that finds all links in, 3. the scraping process itself does not exclude noisey unrelated content which is replaced by the function in the previous bullet point `process_astro_doc_page_content` - airflow/include/tasks/split.py - Modify function `split_html`: it previously splits on specific HTML tags using `HTMLHeaderTextSplitter` but it is not ideal as we do not want to split that often and there is no guarantee splitting on such tags retains semantic meaning. This is changed to using `RecursiveCharacterTextSplitter` with a token limit. This will ONLY split if the chunk starts exceeding a certain number of specified token amount. If it still exceeds then go down the separator list and split further, until splitting by space and character to fit into token limit. This retains better semantic meaning in each chunks and enforces token limit. - airflow/include/tasks/extract/utils/html_utils.py - Change function `fetch_page_content` to add auto retry with exponential backoff using tenacity. - Change function `get_page_links` to make it traverse a given page recursively and finds all links related to this website. This ensures no duplicate pages are traversed and no pages are missing. Previously, the logic is missing some links when traversing potentially due to the fact that it is using a for loop and not doing recursive traversal until all links are exhausted. - Note: This has a huge URL difference. Previously a lot of links were like https://abc.com/abc#XXX and https://abc.com/abc#YYY where the hashtag is the same page but one section of the page, but the logic wasn't able to distinguish them. - - airflow/requirements.txt: adding required packages - api/ask_astro/settings.py: remove unused variables ### Results #### Astro Docs: Better URLs Fetched + Crawling Improvement + HTML Splitter Improvement 1. Example of formatting and chunking - Previously (near unreadable) ![image](https://github.com/astronomer/ask-astro/assets/26350341/90ff59f9-1401-4395-8add-cecd8bf08ac4) - Now (cleaned!) ![image](https://github.com/astronomer/ask-astro/assets/26350341/b465a4fc-497c-4687-b601-aa03ba12fc15) 2. Example of URLs difference - Previously - around 1000 links fetched. Many have DUPLCIATE content since they are the same link. - XMLs and non HTML/website content fetch See old links: [astro_docs_links_old.txt](https://github.com/astronomer/ask-astro/files/14146665/astro_docs_links_old.txt) - Now - No more duplicate pages or unreleased pages - No older versions for software docs, only latest docs being ingested. (e.g.: the .../0.31... links are gone) [new_astro_docs_links.txt](https://github.com/astronomer/ask-astro/files/14146669/new_astro_docs_links.txt) #### Evaluation - Overall improvement in answer and retrieval quality - No degradation noted - CSV posted in comments
- Loading branch information
Showing
18 changed files
with
136 additions
and
130 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.