-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(airflow): refactor extract_astro_blogs method #176
Conversation
Deploying with Cloudflare Pages
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lee-W how have you tested this? Requesting you to run the DAG once and test this change
I ran these function and compare the results. Let me run the tests on DAG as well |
69e7e77
to
273e6dc
Compare
@sunank200 I've tested with local DAGs |
@Lee-W For each of these changes in ingestion. We need to do the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking this change till this is tested by @vatsrahul1001
May I know where is the dev database and dev airflow environment we can ingest? also what is the change you mentioned? do you mean the code change in this PR? |
One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date. Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs. |
I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision |
@Lee-W can you setup a call for this please and lets have a decision on this? After that this change should be tested with fresh ingestion on dev database. |
I just sent a message to the channel for discussion. But IMHO, this might not even be directly part of the PR. |
as discussed earlier, we still need this cutoff date. will proceeded testing on this one |
I changed the host and pointed to the dev WEAVIATE. @vatsrahul1001 Could you please help with the testing? Thanks! |
@vatsrahul1001 We no longer need to test it on cloud. Thanks! @sunank200 I've tested it and send you the result. Could you please take a look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When parsing links, parse the article tag to get the date to filter. Break the loop if any dates are filtered in an iteration.
closes: #116