Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Web2Parquet should expose all options supported by dpk_connector #876

Open
2 tasks done
sujee opened this issue Dec 13, 2024 · 2 comments
Open
2 tasks done
Labels
enhancement New feature or request

Comments

@sujee
Copy link
Contributor

sujee commented Dec 13, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

dpk_connector supports limiting crawl to a domain or path

https://github.com/IBM/data-prep-kit/blob/dev/data-connector-lib/doc/overview.md

Domain and path focus: You can limit domains and paths accessed by the library.

These parameters should be exposed from the simpler API dpk_web2parquet.transform.Web2Parquet

This is important to limit the crawl only within a domain so the crawler doesn't go following links to other domains.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the enhancement New feature or request label Dec 13, 2024
@sujee
Copy link
Contributor Author

sujee commented Dec 13, 2024

Right now, we are extracting specific parameters passed from the config dictionary.

Crawl the web and load content to pyarrow Table.

 self.seed_urls = config.get("urls", [])
        self.depth = config.get("depth", 1)
        self.downloads = config.get("downloads", 10)
        self.allow_mime_types = config.get("mime_types", ["application/pdf","text/html","text/markdown","text/plain"])
        self.folder=config.get('folder', None)

instead, passing the config dictionary to dpk_connector.crawl might be a better option

  • this way we don't have to update dpk_web2parquet transform constantly to keep in sync with dpk_connector
  • users can specify as few or as many parameters they like, giving them ease of use and control

@touma-I
Copy link
Collaborator

touma-I commented Dec 18, 2024

Hi @sujee I think there are two things in this issue:
1- I am not a big fan of exposing all the inner interface parameters to the end user. We do not have enough experience yet to know what works and what breaks so, at least for the short term, I think we should be adding additional parameters as needed. The turn around should not be that long once we have a requirement to expose a new feature and it gives us a chance to incrementally test the capabilities as we open up more functionality.
2- As for Domain and Path focus, I though they are set by defaults to true. Do you have any test script that I can use to confirm one way or the other. In anycase I would be OK with a PR that exposes those two flags.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants