Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to consolidate to a single HTML ingest #164

Closed
11 tasks
mpgreg opened this issue Nov 21, 2023 · 5 comments · Fixed by #237
Closed
11 tasks

Need to consolidate to a single HTML ingest #164

mpgreg opened this issue Nov 21, 2023 · 5 comments · Fixed by #237
Assignees

Comments

@mpgreg
Copy link
Contributor

mpgreg commented Nov 21, 2023

Please describe the feature you'd like to see
Multiple extract functions use almost identical HTML extract logic.

Describe the solution you'd like
Should consolidate to a single function if possible and use dynamic task mapping like github extract.

Are there any alternatives to this feature?

Additional context

Acceptance Criteria

  • All checks and tests in the CI should pass
  • Unit tests
  • Integration tests (if the feature relates to a new database or external service)
  • Example DAG
  • Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
  • Exception handling in case of errors
  • Logging (are we exposing useful information to the user? e.g. source and destination)
  • Improve the documentation (README, Sphinx, and any other relevant)
  • How to use Guide for the feature (example)

Note:

  • After the implementation is complete, the data should be ingested to the dev database and the dev slackbot should be deployed.
  • This change should be tested by @vatsrahul1001 and after his take on quality of response, this should be merged
@sunank200
Copy link
Collaborator

Note:

  • After the implementation is complete, the data should be ingested to the dev database and the dev slackbot should be deployed.
  • This change should be tested by @vatsrahul1001 and after his take on quality of response, this should be merged

@phanikumv phanikumv changed the title Need to consolidate to a single HTMl ingest Need to consolidate to a single HTML ingest Dec 8, 2023
@sunank200
Copy link
Collaborator

Discussed with @pankajastro - a single HTML extractor makes sense.

@pankajastro
Copy link
Collaborator

pankajastro commented Dec 14, 2023

just adding more here a single extractor makes sense but still we require a thin layer over it for the different sources because we need some different cleanup approaches for different sources for example in Astro SDK I'm excluding if the docs URL has "autoapi", "genindex.html", "py-modindex.html", ".md", ".py" but for provider excluding "_api", "_modules", "_sources", "changelog.html", "genindex.html", "py-modindex.html", "#". So at least a different task per source makes sense to me and if we keep a different dag per source then maybe it will be easy to run i.e we can upsert only the source we want

@phanikumv
Copy link
Collaborator

Created draft PR on this one

@pankajastro
Copy link
Collaborator

just marked PR as ready for review, would appreciate a review

pankajastro added a commit that referenced this issue Jan 11, 2024
closes: #164

Currently, We have some duplicate code in the HTML extractor, this PR
aims to remove the duplicate code and reuse it from html_utils.
sunank200 pushed a commit that referenced this issue Jan 16, 2024
closes: #164

Currently, We have some duplicate code in the HTML extractor, this PR
aims to remove the duplicate code and reuse it from html_utils.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants