Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve project bootstrapping #538

Merged
merged 19 commits into from
Dec 2, 2024
Merged

feat: Improve project bootstrapping #538

merged 19 commits into from
Dec 2, 2024

Conversation

janbuchar
Copy link
Collaborator

@janbuchar janbuchar commented Sep 20, 2024

This adds a unified crawlee/project_template template. The original playwright and beautifulsoup templates are kept for compatibility with older versions of the CLI.

The user is now prompted for package manager type (pip, poetry), crawler type, start URL and whether or not Apify integration should be set up.

TODO

  • http client selection
  • disable poetry option if it isn't installed
  • rectify the pip-based setup
    1. manual dependency installation - no automatic installation, just dump requirements.txt and tell the user to handle it any way they want
    2. pip+venv - dump requirements.txt, make a virtualenv (.venv) using the current python interpreter, install requirements and tell user to activate it
    - should be disabled if venv module is not present it's stdlib
  • test the whole thing on Windows (mainly the various package manager configurations)
  • fix how cookiecutter.json is read (it is not present when installing via pip)

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 20, 2024
@janbuchar janbuchar requested review from vdusek and B4nan September 20, 2024 09:12
@github-actions github-actions bot added this to the 98th sprint - Tooling team milestone Sep 20, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Sep 20, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@janbuchar
Copy link
Collaborator Author

janbuchar commented Sep 20, 2024

okay, this could use some testing now... plus there's a bunch of stuff to consider

  • the pip thing is probably not very ergonomic and I'm not sure how to approach it well... should we create a .venv for users? should we have a requirements.txt and use it as a lockfile?
  • does this work on windows?
  • isn't the creation dialog too long now? maybe some of the options don't need the prompt and just CLI flags are enough
  • should we include http client selection?
  • we should probably scream and shout if poetry is not installed... or we could just disable the option in that case

@janbuchar janbuchar marked this pull request as ready for review September 20, 2024 12:33
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, just a few comments

templates/crawler/cookiecutter.json Outdated Show resolved Hide resolved
templates/crawler/{{cookiecutter.project_name}}/README.md Outdated Show resolved Hide resolved
src/crawlee/_cli.py Show resolved Hide resolved
@janbuchar
Copy link
Collaborator Author

Thanks for the review @vdusek! Do you have anything to add regarding those points for consideration?

@vdusek
Copy link
Collaborator

vdusek commented Sep 25, 2024

the pip thing is probably not very ergonomic and I'm not sure how to approach it well... should we create a .venv for users? should we have a requirements.txt and use it as a lockfile?

That is generally how pip works. I wouldn't add additional overhead here. Simply dumping the dependencies into requirements.txt should be enough. Managing the environment and installation should be left to the user - that's the standard approach with pip.

isn't the creation dialog too long now? maybe some of the options don't need the prompt and just CLI flags are enough

IMO the current setup is fine. Also, all options have default values.

should we include http client selection?

That would be a nice feature, but I wouldn't prioritize it at this moment. We have a guide covering that, which should be sufficient for now.

we should probably scream and shout if poetry is not installed... or we could just disable the option in that case

I would suggest throwing an error. If a user wants to use Poetry, the option should be conditionally on Poetry being installed.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Pip versions does not work

@janbuchar
Copy link
Collaborator Author

the pip thing is probably not very ergonomic and I'm not sure how to approach it well... should we create a .venv for users? should we have a requirements.txt and use it as a lockfile?

That is generally how pip works. I wouldn't add additional overhead here. Simply dumping the dependencies into requirements.txt should be enough. Managing the environment and installation should be left to the user - that's the standard approach with pip.

I generally agree. But how would you ensure that pipx run crawlee create ... installs the dependencies in a venv? Often people will want to keep the virtualenv in the project directory, which doesn't get created until you run crawlee create.

should we include http client selection?

That would be a nice feature, but I wouldn't prioritize it at this moment. We have a guide covering that, which should be sufficient for now.

It's a 15-minute adventure, the only problem is that it'd make the dialog longer.

@vdusek vdusek changed the title feat: Improved project bootstrapping feat: Improve project bootstrapping Oct 31, 2024
@janbuchar
Copy link
Collaborator Author

@janbuchar janbuchar marked this pull request as draft November 21, 2024 12:55
src/crawlee/_cli.py Outdated Show resolved Hide resolved
src/crawlee/_cli.py Outdated Show resolved Hide resolved
@janbuchar janbuchar requested a review from vdusek December 2, 2024 13:38
@janbuchar janbuchar marked this pull request as ready for review December 2, 2024 13:38
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide http client option together with the PW crawler, is that what we want?

vdusek

This comment was marked as resolved.

@janbuchar

This comment was marked as resolved.

@janbuchar janbuchar requested a review from vdusek December 2, 2024 14:46
@janbuchar
Copy link
Collaborator Author

We provide http client option together with the PW crawler, is that what we want?

As long as we don't use playwright.request in PlaywrightCrawlingContext.send_request, it makes perfect sense. After that, it still might make sense in some niche cases, but the utility of giving that choice to users in project bootstrapping will be debatable.

@vdusek
Copy link
Collaborator

vdusek commented Dec 2, 2024

As long as we don't use playwright.request in PlaywrightCrawlingContext.send_request, it makes perfect sense. After that, it still might make sense in some niche cases, but the utility of giving that choice to users in project bootstrapping will be debatable.

I would not expose the option to the user, IMO it would results only to confusion. Net a request though.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested a lot of variants, seems to be working, good job!

@janbuchar janbuchar merged commit 367899c into master Dec 2, 2024
23 checks passed
@janbuchar janbuchar deleted the improve-templates branch December 2, 2024 15:22
Mantisus pushed a commit to Mantisus/crawlee-python that referenced this pull request Dec 10, 2024
This adds a unified `crawlee/project_template` template. The original
`playwright` and `beautifulsoup` templates are kept for compatibility
with older versions of the CLI.

The user is now prompted for package manager type (pip, poetry), crawler
type, start URL and whether or not Apify integration should be set up.

- closes apify#317
- closes apify#414 (http client selection is not implemented)
- closes apify#511
- closes apify#495

### TODO

- [x] http client selection
- [x] disable poetry option if it isn't installed
- [x] rectify the pip-based setup
1. **manual dependency installation** - no automatic installation, just
dump requirements.txt and tell the user to handle it any way they want
2. **pip+venv** - dump requirements.txt, make a virtualenv (.venv) using
the current python interpreter, install requirements and tell user to
activate it
- ~should be disabled if `venv` module is not present~ it's stdlib
- [x] test the whole thing on Windows (mainly the various package
manager configurations)
- [x] fix how cookiecutter.json is read (it is not present when
installing via pip)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
2 participants