Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTML Splitter, URL Fetching Logic & Astro Docs Ingestion #293

Merged

Conversation

davidgxue
Copy link
Contributor

@davidgxue davidgxue commented Feb 2, 2024

Description

  • This is 1st part of a 2 part effort to improve the scraping, extraction, chunking and tokenizing logic for Ask Astro's data ingestion process. (see details in this issue Research & Implement: Data Ingestion Related Improvements #258)
  • This PR mainly focuses on improving noise from ingestion process of the Astro Docs data source, along with some other related changes such as only scraping the latest doc versions, add auto exponential backoff on html get function and etc.

Closes the Following Issues

Partially Completes Issues

Technical Details

  • airflow/include/tasks/extract/astro_docs.py
    • Add function process_astro_doc_page_content: which gets rid of noisey not useful content such as nav bar, footer, header and only extract the main page article content
    • Remove the previous function scrape_page (which scraps the HTML content AND finds scraps all its sub pages using links contained). This is done since 1. there is already a centralized util function called fetch_page_content() that does the job of fetching each page's HTML elements, 2. there is already a centralized util function called get_internal_links that finds all links in, 3. the scraping process itself does not exclude noisey unrelated content which is replaced by the function in the previous bullet point process_astro_doc_page_content
  • airflow/include/tasks/split.py
    • Modify function split_html: it previously splits on specific HTML tags using HTMLHeaderTextSplitter but it is not ideal as we do not want to split that often and there is no guarantee splitting on such tags retains semantic meaning. This is changed to using RecursiveCharacterTextSplitter with a token limit. This will ONLY split if the chunk starts exceeding a certain number of specified token amount. If it still exceeds then go down the separator list and split further, until splitting by space and character to fit into token limit. This retains better semantic meaning in each chunks and enforces token limit.
  • airflow/include/tasks/extract/utils/html_utils.py
    • Change function fetch_page_content to add auto retry with exponential backoff using tenacity.
    • Change function get_page_links to make it traverse a given page recursively and finds all links related to this website. This ensures no duplicate pages are traversed and no pages are missing. Previously, the logic is missing some links when traversing potentially due to the fact that it is using a for loop and not doing recursive traversal until all links are exhausted.
    • Note: This has a huge URL difference. Previously a lot of links were like https://abc.com/abc#XXX and https://abc.com/abc#YYY where the hashtag is the same page but one section of the page, but the logic wasn't able to distinguish them.
  • airflow/requirements.txt: adding required packages
  • api/ask_astro/settings.py: remove unused variables
  • All other DAGs: changed batch size as it was hitting OpenAI rate limit when batched to 1k

Results

Astro Docs: Better URLs Fetched + Crawling Improvement + HTML Splitter Improvement

  1. Example of formatting and chunking
  • Previously (near unreadable)
    image
  • Now (cleaned!)
    image
  1. Example of URLs difference
  • Previously
    • around 1000 links fetched. Many have DUPLCIATE content since they are the same link.
    • XMLs and non HTML/website content fetch
      See old links: astro_docs_links_old.txt
  • Now
    • No more duplicate pages or unreleased pages
    • No older versions for software docs, only latest docs being ingested. (e.g.: the .../0.31... links are gone)
      new_astro_docs_links.txt

Evaluation

  • General improvement in response quality and document quality retrieved. Better quoting from the docs
  • See a subset of evaluation results
    evaluation_result_improve_1.csv

Copy link

cloudflare-workers-and-pages bot commented Feb 13, 2024

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: bbfa510
Status: ✅  Deploy successful!
Preview URL: https://3291af56.ask-astro.pages.dev
Branch Preview URL: https://improve-html-splitter-url-fe.ask-astro.pages.dev

View logs

@davidgxue davidgxue self-assigned this Feb 13, 2024
@davidgxue davidgxue marked this pull request as ready for review February 21, 2024 06:37
airflow/requirements.txt Outdated Show resolved Hide resolved
@davidgxue davidgxue merged commit c43ffc1 into main Feb 23, 2024
8 checks passed
@davidgxue davidgxue deleted the improve_html_splitter_url_fetch_and_astro_docs_ingestion branch February 23, 2024 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants