Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: update Crawler to support selenium>=4.11.0 and simplify it #5515

Merged
merged 4 commits into from
Aug 8, 2023

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Aug 4, 2023

Related Issues

Proposed Changes:

With the recent release of Selenium, the Crawler started failing both locally and in Colab.
Thanks to changes recently introduced in Selenium, we can get rid of the webdriver-manager dependency:
now Selenium automatically finds or installs the required drivers.

  • I pinned selenium>=4.11.0 (and removed webdriver-manager)
  • I changed the crawler initialization to make it work
  • I took the opportunity to refactor and simplify the code

How did you test it?

Existing tests (CI), manual tests on Ubuntu and in Colab.

Checklist

@anakin87 anakin87 requested a review from a team as a code owner August 4, 2023 15:55
@anakin87 anakin87 requested review from dfokina and removed request for a team August 4, 2023 15:55
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 5764270194

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 81 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.003%) to 46.887%

Files with Coverage Reduction New Missed Lines %
nodes/connector/crawler.py 81 37.64%
Totals Coverage Status
Change from base Build 5764030079: 0.003%
Covered Lines: 10891
Relevant Lines: 23228

💛 - Coveralls

@agnieszka-m agnieszka-m requested review from agnieszka-m and removed request for dfokina August 8, 2023 09:55
@vblagoje
Copy link
Member

vblagoje commented Aug 8, 2023

PR Analysis

  • 🎯 Main theme: Refactoring the Crawler to support selenium>=4.11.0 and simplifying the code
  • 📌 Type of PR: Refactoring
  • 🏅 Score: 90
  • 🧪 Relevant tests added: Yes
  • Focused PR: Yes, the PR has a clear and coherent title and description, and all PR code diff changes are properly derived from the title and description.
  • 🔒 Security concerns: No, the PR does not introduce any apparent security concerns. The changes are mainly related to the update of a dependency and the simplification of the code, which do not involve any security-sensitive operations.

PR Feedback

  • 💡 General PR suggestions: The PR is well-structured and the changes are well-documented. The removal of the 'webdriver-manager' dependency and the simplification of the Crawler initialization are good improvements. However, it would be beneficial to provide more detailed information on the manual tests conducted on Ubuntu and in Colab, to ensure the changes work as expected in all environments.

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @anakin87, let's leave a link to your Colab on this PR

@anakin87
Copy link
Member Author

anakin87 commented Aug 8, 2023

Colab notebook: https://colab.research.google.com/drive/1qvnKmjLt_OdloVo56QjMcQatYtCqVSsk?usp=sharing

@anakin87 anakin87 merged commit 3f47299 into main Aug 8, 2023
58 checks passed
@anakin87 anakin87 deleted the update_crawler_selenium_4.11 branch August 8, 2023 13:13
DosticJelena pushed a commit to smartcat-labs/haystack that referenced this pull request Aug 23, 2023
…eepset-ai#5515)

* refactor crawler

* rm unused imports

* release notes!

* rm outdated mock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crawler does not work with selenium>=4.11.0
4 participants